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Efficient SDP Inference for Fully-connected CRFs 
Based on Low-rank Deeomposition 

Peng Wang, Chunhua Shen, Anton van den Hengel 


Abstract 

Conditional Random Fields (CRF) have been widely used in a variety of computer vision tasks. Conventional 
CRFs typically define edges on neighboring image pixels, resulting in a sparse graph such that efficient inference 
can be performed. However, these CRFs fail to model long-range contextual relationships. Fully-connected CRFs 
have thus been proposed. While there are efficient approximate inference methods for such CRFs, usually they 
are sensitive to initialization and make strong assumptions. In this work, we develop an efficient, yet general 
algorithm for inference on fully-connected CRFs. The algorithm is based on a scalable SDP algorithm and the low- 
rank approximation of the similarity/kernel matrix. The core of the proposed algorithm is a tailored quasi-Newton 
method that takes advantage of the low-rank matrix approximation when solving the specialized SDP dual problem. 
Experiments demonstrate that our method can be applied on fully-connected CRFs that cannot be solved previously, 
such as pixel-level image co-segmentation. 


I. Introduction 

Semantic image segmentation or pixel labeling is a key problem in computer vision. Given an image, the task 
is to label every pixel against one or multiple pre-defined object categories. It is clear that to achieve satisfactory 
results, one must exploit contextual information. Scalability and speed of the algorithm are also of concerns, if we 
are to design an algorithm applicable to high-resolution images. 

Conditional random fields (CRFs) have been one of the most successful approaches to semantic pixel labeling, 
which solves the problem as maximum a posteriori (MAP) estimation. Standard CRFs contain unary potentials that 
are typically defined on low-level features of local texture, color, and locations. Edge potentials, which are typically 
defined on 4- or 8-neighboring pixels, consist of smoothness terms that penalize label disagreement between similar 
pixels, and terms that model contextual relationships between different classes. Although these CRF models have 
achieved encouraging results for segmentation, they fail to capture long-range contextual information. 

In the literature, fully-connected CRFs have been proposed for this purpose. The main challenge for inference 
on fully-connected CRFs stems from the computational cost. A fully-connected CRF over N image pixels has 
edges. Even for a small images with a few thousand pixels, the number of edges can be a few million. Although 
there have been a variety of methods for MAP estimation lUl, iU, Q, @|, lEl, 0, lCl> El; ifT^ . they are 
usually computationally infeasible for such cases. The authors of ifTTl . ifl^ have proffered an efficient mean field 
approximation method for MAP inference in multi-label CRF models with fully connected pairwise terms. A filter- 
based method is used to accelerate the computation. The assumption is that the pairwise terms are in the form 
of a weighted mixture of Gaussian kernels such that fast bilateral filtering can be applied. For a special type of 
fully-connected CRF, in which the edge potentials are defined to capture the spatial relationships among different 
objects, and only depend on their relative positions (that is they are spatially stationary), an efficient inference 
algorithm was developed in iHCll . The method proposed in |[T4ll can be applied on generalized RBF kernels, instead 
of the original Gaussian kernels. Note that there is still a strong assumption in ifldl . which limits the practical value 
of this method. 

In general, semidefinite programming (SDP) relaxation provides accurate solutions for MAP estimation problems, 
but it is ususally computationally inefficient (see |0 for the comparison of different relaxation methods). Standard 
interior-point methods require 0{rn^+mn^+ 171 ^ 11 ?) flops to solve a generic SDP problem in worst-case, where n and 
m are the semidefinite matrix dimension and the number of constraints respectively. Recently, several scalable SDP 
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X A matrix (bold upper-case letters). 

X A column vector (bold lower-case letters). 

5" The space of n x n symmetric matrices. 

5+ The cone of n x n symmetric positive semidefinite (SPSD) matrices. 

R" The space of real-valued n x 1 vectors. 

R" ,R" The non-negative and non-positive orthants of R". 

In The n X n identity matrix. 

0 An all-zero vector with proper dimension. 

1 An all-one vector with proper dimension. 

<, > Inequality between scalars or element-wise inequality between column vectors. 
diag(X) The vector of the diagonal elements of the input matrix X. 

Diag(x) The n x n diagonal matrix whose main diagonal vector is the input vector x. 

trace(') The trace of a matrix. 
rank( ) The rank of a matrix. 

S{cond) The indicator function which returns 1 if cond is ture and 0 otherwise. 

II'IIf Frobenius-norm of a matrix. 

(•, •) Inner product of two matrices, 
o Hadamard product of two matrices. 

(g) Kronecker product of two matrices. 

Vf(-) The first-order derivative of function f(-). 

V^f(-) The second-order derivative of function f(-). 
n! The factorial of a non-negative integer n. 


TABLE I: Notation. 


methods have been proposed for MAP estimation. Huang et al. |[T5l proposed an alternating direction methods of 
multipliers method (ADMM) to solve large-scale MAP estimation problems. Wang et al. |[T6l presented an efficient 
dual approach (refer to as SDCut), which can also be applied for MAP estimation. However, their methods still 
cannot be applied directly to large-scale fully-connected CRFs. 

There are two key contributions in this work: 

(z) An efficient low-rank SDP approach (based on SDCut) for MAP estimation is proposed for MAP estimation 
in large-scale fully-connected CRFs. Several significant improvements over SDCut are presented, which makes 
SDCut much more scalable. The proposed SDP method also overcomes a number of limitations of mean field 
approximation, which provide more stable and accurate solutions. 

{ii) Low-rank approximation methods for SPSD kernels (whose kernel matrix is symmetric positive semidefinite) is 
seamlessly integrated into the proposed SDP method, and used to accelerate the most computational expensive part 
of the proposed SDP method. The use of low-rank approximation relaxes the limitation on the pairwise term from 
being (a mixture of) Gaussian kernels to all symmetric positive-semidefinite kernels. The low-rank approximation 
method can be also used to replace the filter-based method in ifTTI for mean field approximation. 

Thus our method is much more general and scalable, which has a much broader range of applications. The 
proposed SDP approach can handle fully-connected CRFs of 7 )^:states x ^variables up to 10®. In particular, we 
show that on an image co-segmentation application, the fast method of lITTll is not applicable while our method 
achieves superior segmentation accuracy. To our knowledge, our method is the first pixel-level co-segmentation 
method. All previous co-segmentation methods have relied on super-pixel pre-processing in order to make the 
computation tractable. Wang et al. ifTTII and Frostig et al. iflSl also proposed efficient approaches which find near- 
optimal solutions to SDP relaxation to MAP problems. The main difference is that their methods solve (generally 
nonconvex) quadratically constrained quadratic programs by projected gradient descent, while ours uses quasi- 
Newton methods to solve a convex semidefinite least-square problem. Notation is listed in Table 


H. Preliminaries 


A. Fully-connected Pairwise CRFs with SPSD kernels 

Consider a random field over N random variables x = [xi, X 2 , • • •, conditioned on the observation I. Each 
variable can be assigned a label from the set £ := The energy function of a CRF (I, x) can be expressed 

by the following Gibbs distribution: 


P(x|I) 


1 

W) 


exp(—E(x|I)) 


( 1 ) 
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where E(x|I) denotes the Gihhs energy function w.r.t. a labelling x G and Z{1) := XlxeC" exp(“E(x|I)) is 
the partition function. In the rest of the paper, the conditioning w.r.t. I is dropped for notational simplicity. 

Assuming E(x) only contains unary and pairwise terms, the MAP inference problem for the CRF (I, x) is 
equivalent to the following energy minimization problem: 

min E(x) := jiiixj) + ^ Tpij{xi,Xj), (2) 

i£jV i,jGjV,i<j 

where J\f := {1 ,... ,N}, and : £ —)• M, Vi G A/" and —)• M, Vi,y G M, i j correspond to the unary and 

pairwise potentials respectively. 

The pairwise potentials considered in this paper can be written as: 

M 

i;ij{xi,Xj) := n{xi,Xj) ^ f,), (3) 

m=l 

where fj,fj G indicate D-dimentional feature vectors corresponding to variables Xi and Xj respectively, : 

X —)• M denotes the m-th SPSD kernel and G M+ is the associated linear combination weight. 

Following the term in ifTTl . — )• [0,1] is used to represent a symmetric label compatibility function, which 

has the properties that G £ and = 0, V/ G £. The label compatibility function penalizes 

similar pixels being assigned with different/incompatible labels. A simple label compatibility function would be 
given by Potts model, that is = 6{l / I'). The form of pairwise potential in Q is very general, and can be 

used to represent many potentials of practical interest. 

Mean field approximation is used in ifTTI for solving problem Q, which is considered to be state-of-the-art. A 
filter-based method is used to accelerate the computation of message passing step (the product of kernel matrix and 
a column vector). In the following two sections, we will briefly revisit mean field approximation and the filter-based 
method, especially their respective limitations. 

B. Mean Field Approximation 

In mean field approximafion, a variational disfribufion Q{'x.) is infroduced fo approximafe fhe Gibbs disfribufion 
P(x), in which fhe marginals for each variable in CRF, supposed fo be independenf fo each ofher 

such fhat (5(x) can be completely factorized as 

Q{-k) = Ui(zj^Qi{xi). ( 4 ) 

Over the distribution Q(x), mean field approximation minimizes the KullbackLeibler (KL)-divergence D(Q||P): 

D(Q||P)= Q(x)log||^ (5a) 

= ^ Q(x}logQ(x) - ^ Q(x)log^exp(-E(x)) (5b) 

xe£~ xe£« 

= XI X Q(x)E(x)-f logZ (5c) 

xe£« xe£« 

Recall the definition of the energy function for fully-connected pairwise CRFs in Q and the complete factorization 
Q, we further have that 

= E E Qi(xi) log Qi(xi) + EE Qi(Xi)^i(Xi) 

i&M Xi&C iSjV Xi&C 

+ E E Qiixi)Qj{xj)'ilJij{xi,Xj) + logZ, ( 6 ) 

Xi,Xj&C 

To optimize over the f-th marginal, the KL-divergence D(Q||P) is viewed as a function of Qi{-) while keeping 
other marginals fixed: 

D(Q||F) = y; Qi{Xi) log QiiXi) + X f '4’iixi) + E E Qj{xj)'ilJij{xi,Xj)] -fconsf. (7) 

Xi&C Xi&C 
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It is easy to find out that minimizing D(Q||P) w.r.t. Qi{-) gives the following close-formed solutions, namely mean 
field equations'. 

= Jr exp f ^ '^Qj{xj)'il>ij{xi,Xj)j,\/ieJ\f, (8) 

where Zi is the local normalization factor such that Xlx ~ Updating the above mean field equations 

iferafively resulfs in a monofonically decreased D((5||P). 

One significanf limifafion of mean field approximafion is fhaf if may converge fo one of potentially many local 
opfima, because fhe variafional problem fo be opfimized may be non-convex. A consequence of Ibis non-convexify 
is fhaf mean field is oflen sensifive fo fhe inifializafion of Q. 


C. Filter-based Matrix-vector Product 

Recall fhe pairwise ferms defined by SPSD kernels as in Q, fhe mean field equations ([^ can be further expressed 
as: 


* I'&C m=l 


( 9 ) 


The compufafional boffleneck in updafing fhe above equafion can be expressed as fhe mafrix-vecfor producfs 
K(m)q^ m = I,-- - , M, where G denotes fhe kernel mafrix corresponding fo fhaf is = 

and q G denotes a column vector made up by Q, fhaf is q = [Qi(f), • • • , QatCOT’ ^ The 
naive implemenfafion of fhe mafrix-vecfor producf needs O(iV^) fime. Krahenbiihl and Kolfun lOTI proposed fo 
use a filfer-based approach fo compufe fhe mafrix-vecfor producf in 0{N) fime, which will be discussed in fhe 
nexf section. 

Filler-based mefhods |[T^ have been used in ifTTI fo speed up fhe above mafrix-vecfor producf. The melhod in 
ifm is based on fhe assumplion fhaf pairwise polenlials are Gaussian kernels: 


k(™)(fi,f,) = exp 




( 10 ) 


where G S^, m = 1, 2, • • • , M. The producf of a Gaussian kernel mafrix and an arbilrary column vector 

can be expressed as a Gaussian convolution w.r.l. A^"^^ in fealure space (see ifT^ . ifTTI for more delails). From fhe 
viewpoinl of signal processing, fhe Gaussian convolulion can be seen as a low-pass filter over fhe fealure space. 
Then fhe convolulion resull can be recovered from a sel of samples whose spacing is proportional to fhe sfandard 
devialion of fhe filler. A number of filtering mefhods ll20l . ifT^ can be used fo compute fhe convolulion efficienfly, 
in which fhe compufafional complexify and memory requiremenl are bolh linear in N. 

Filler-based approaches have a number of limilafions, however: 

(f) In general, fhe pairwise polenlials are limited fo Gaussian kernels over a Euclidean fealure space. 

(ii) The fealure dimension cannol be very high. The bilateral tillering melhod in EOll has an exponenlial complexify 
in fhe dimension D. The time complexify of permufohedral lalfice lfT9l is quadralic in D, which works well only 
when fhe inpul dimension is 5 ~ 20. Beacause if does nol creale new lalfice poinls during fhe blur step, accuracy 
penally is accumulaled wilh fhe growlh of fealure dimension. 


D. Semidefinite Programming and SDCut Algorithms 

Semidefinile programming (SDP) is a class of convex oplimizalion problems, which minimize/maximize a linear 
objeclive function over fhe inferseclion of fhe cone of positive semidefinile malrices wilh an affine space. A general 
SDP problem can be expressed in fhe following form: 

min p(Y):=(Y,A), (11a) 

Y 

s.t. (Y,Bi) = f = 1,2, • • • ,q, (11b) 

where A, {Bi}i=i^... G b G M”, q and n are positive integers denoting fhe number of linear conslrainls and 
fhe dimension of mafrix variables respecfively. 
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SDP relaxation is widely incorporated to develop approximation algorithms for binary quadratic program (BQP), 
which optimizes a quadratic objective function over binary variables y G SDP-based approximation 

algorithms typically solve the BQP in the following two steps: 

(i) Lift the binary variable y to a positive semidefinite matrix variable Y := yy^ and solve the relaxed SDP 
problem over Y to a certain accuracy. 

(ii) Round the SDP solution to obtain an approximated solution to the original BQP 

SDP problems can be solved by standard interior-point methods, which can be found in a number of optimization 
toolboxes, such as SeDuMi ||2T1, SDPT3 and MOSEK Although accurate and stable, interior-point methods 
scale poorly to the matrix dimension n and the number of linear constraints q. The computational complexity of 
interior-point methods for SDP problems is 0{m^+ mn^+ m?'n?) at each iteration in worst-case, and the associated 
memory requirement is 0 {m? + v?). 

The method proposed in ifT^ . denoted as SDCut, can be used to solve the SDP problem ( [TT] ) approximately yet 
efficiently. SDCut solves the following approximation of ( [TT] ) using quasi-Newton 

P 7 (Y) := (Y,A) + ^(||Y|||,-r/2), (12a) 

YScS" 27 

s.t. (Y,Bi) = f = 1,2, ••• ,q, (12b) 


where 7 > 0 is a penalty parameter. Assuming that the constraints ( |12b| ) encode trace(Y) = 77 , where r/ is a 
constant defined by fhe problem itself, the above approximation have the following properties: 


Proposition 1. The following results holds: (i)Ve>0, 37>0 such that |p(Y*) — p(Y*)| < e, where Y* denotes 
the optima for (HI]' and Y* denotes that for w.r.t. 7 . (ii) For 72 > 7 i > 0, we have p(Y*J > p(Y*J, where 
Y and Y are the optimal solutions of ( |12| ) for 71 and 72 respectively. 

Proof These results rely on the properties that trace(Y) = rj. See lIT^ for details. □ 

The above results show that ( fT2j ) is an accurate approximation to the problem ( [TT| |, as the solution to ( [T2| ) can 
be sufficiently close to that to ( [TT] ) given a large enough 7 . The advantage of ( [T2] ) is that it has a much simpler 
Lagrangian dual: 

Proposition 2. The Lagrangian dual problem of can be simplified to 

2 

max d^(u) :=-^||(C(u))+|||-u^b-^ (13) 

Z Z'^ 

where C(-) : —)■ 5” is defined as C(u) := —A — and (•)+ : 5** —)• 5” is defined as (Y)+ = 

rDiag(max(0,A))r^. A := [Ai,.. ., F stand for the respective eigenvalues and eigenvectors ofY, that 

is Y = rDiag(A)r^. The relationship between the optimal solution to the primal (0 Y* and the solution to the 
dual ( [T3] ) u* is: Y* = 7 (C(u*))_|_. 

Proof See IT6l for details. □ 

It is easy to find fhaf fhe Lagrangian dual problem ( [T3] ) is convex, and fhe p.s.d. matrix variable is eliminated in 

the dual. It is also proved in ifT^ that the simplified dual problem has fhe following nice properties: 


Proposition 3. Vu G V 7 > 0 , d-),(u) yields a lower-bound on the optimal objective function value of the problem 

Proposition 4. d(-) is continuously differentiable but not necessarily twice differentiable, and its gradient is given 
by 

Vd.,(u) = -7 [((C(u))^ , Bi), • • • , ((C(u))^ , B,)f - b. (14) 

Such that Wang et al. ifT^ adopted quasi-Newton methods to solve the dual problem ( [T3] ). At each iteration 
of quasi-Newton methods, only the objective function d-,, and its gradient ( [T4] ) need to be computed, where 
the computational bottleneck is on the calculation of (C(u))^, which is equivalent to obtaining all the positive 
eigenvalues and the corresponding eigenvectors of C(u). Note that although the SDP problem discussed in this 
paper contains only linear equality constraints, the SDCut method and the proposed method can both easily extended 
to SDP problems with linear inequality constraints. 
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III. Matrix-VECTOR Product based on Low-rank Approximation 


One key contribution of this paper is the use of a low-rank approximation to the positive semidefmite kernel 
matrix, based on which low-rank quasi-Newton methods are developed for large-scale SDP CRF inference. We 
propose to approximate an SPSD kernel matrix K G S!^ by a low-rank representation: K « where $ G 

■^NxRk ^ qP computational complexity and memory requirement for computing the 

aforementioned matrix-vector product are linear in N. Compared to ll20ll . ifT^ . the pairwise potential function is 
generalized to any positive semidefinite kernel function and there is no restriction on the input feature dimension. 

The best quality can be achieved by a low-rank approximation depends on the spectral distribution of the kernel 
matrix itself, which is related to the smoothness (differentiability and Lipschitz continuity) of the underlying kernel 
function (see d, na, m, d, d, am for more details). In general, eigenvalues of smooth kernels (e.g. 
Gaussian kernel) decay quickly and thus can be well approximated by low-rank matrices. 

The optimal low-rank approximation in terms of both the spectral norm and Frobenius norm can be obatined by 
eigen-decomposition, while it is computationally inefficient whose computational complexity is generally cubic in 
N. There are a number of low-rank approximation methods achieving linear complexity in N, including Nystrdm 
methods ll30l . lISTI . lim . incomplete Cholesky decomposition |[33l, ll34l . random Fourier features ||35l, (Ml, and 
homogeneous kernel maps liTTll . For detailed discussion, please refer to the review papers |[38l . lim . BOl . We adopt 
Nystrom methods ll37ll in this paper for the low-rank approximation of kernel matrices. 

Nystrom methods can be used to approximate a positive semidefintie matrix K G S^, by sampling i?o <C iV 
columns of K (refer to as landmarks). Firstly K is expressed as: 


w K2,1^ 

K2,1 K2,2 


(15) 


where W G S^° denotes the intersection of the sampled Rq columns and rows. The matrix K 2,2 F <5^ can be 
approximated as: 


K2,2 ~ K2,iTr'S^^T\K2/, 


(16) 


where R < Rq and YIr = Diag([Ai,..., A/j]^). Ai > A 2 > • • • > A/^ > 0 are the /^-largest eigenvalues of W 
and F/j contains the corresponding (column) eigenvectors. Note that is the best rank-i? approximation 

to W. Then we have a rank-i? approximation to K: 


K 


W 

K2,1 


FrS 


R^r 


(\ ^ 

UK2,1 


FrE 


1 

2 


R 



(17) 


which is proved to have a bounded error to the optimal rank-7? approximation given by the eigen-decomposition ll^ . 

There are several strategies to sample representative landmarks, i.e., columns of K, including the standard uniform 
sampling ISTI . non-uniform sampling |[32l and fe-means clustering lldTl . In this paper, we adopt the fc-means method 
in II 4 TII to select landmarks. At each round of fe-means, only R columns of K, rather than the entire matrix K, is 
required to be instantiated. 

Note that for Nystrom methods, the positive semidefinite matrix K to be approximated can be any m-th kernel 
matrix or the summation 


IV. SDP Relaxation to MAP Estimation Problems 

In this section, we introduce SDP relaxation to the problem Q. Throughout the main body of this paper, the 
label compatibility function is assumed to be given by Potts model, that is = 5{l I'). The SDP relaxation 

corresponding to an arbitrary label compatibility function is discussed in Section |VIII-A| 

By defining X G H G and K G 5^ as Xi^i = 5{xi = 1), = 'fi{l) and Kij = 
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fi)^ the objective function of Q can be re-written as: 

M 

iSA/” i,j&N,i<j m=l 

= ^ 'lpi{l)d{Xi = l)+ X 

= X = 0 - ^ X X ^ X 

i£jV,leC i,j&Nl,l'&C i,j&M 

= (h,x)-^(xx'",k) + Xtki. 

Such that the problem Q can be expressed as the following binary quadratic problem (BQP): 

min E(X) := (H, X) - -(XX”^, K) 
xe{o,i}«x^ ^ ^ ^ ^ 2^ ' ' 

s-t- YA=iXi,i = 1, Vi G M, 


(18a) 

(18b) 

(18c) 

(18d) 

(19a) 

(19b) 


Note that there is a one-to-one correspondence between the set of x G and the set of X G {0,1}^'^^^ satisfying 
( |19b| ), and E(x) = E(X) -|- ^l^Kl for equivalent x and X. 


By introducing Y := 


II 

X 


II 

X 


T 


the corresponding SDP relaxation to problem ([T9l) can be expressed as: 


min 

Yg^N+i 


S.t. 


— 1) I € C, 

^{ni>+Yi,,i) = o, i<i',i/ gc, 
2 ^f=i0^i+L,l + = 1, i G Af, 

^-l-L,i+L — 1) i £ ■ 


(20a) 

(20b) 

(20c) 

(20d) 

(20e) 


Clearly we have trace(Y) = N + L which is implicitly encoded by the linear constraints, and rank(Y) = L which 
is non-convex and dropped by the SDP relaxation. 

In the above formulation, the objective function and all the constraints (|20b|), (|20ci, (20d i, (20el are linear 


in Y. Therefore the problem 
g = 2iV + L(L + l)/2, A = ^ 


0 

H 


can be re-written in the general form of o, in which n = rj = N + L, 


H 


-K 


and thus solved using SDCut lIT^ . 


V. Low-rank Quasi-Newton Methods For SDP Inference 

In this section, we follow the method in m (denoted as SDCut) which solves general BQPs. Several major 
improvements are proposed to make SDCut scalable to the large-scale energy minimization problem & which is 
another key contribution of this paper. 

Although it is shown in iTT^ that SDCut already runs much faster than standard interior-point methods, there are 
still several issues to be addressed for the problem to be solved in this work: 

(i) It is shown in lIT^ that rank((C(u))_|_) drops significantly in the first several iterations, and Lanczos methods ll42l 
can be used to efficiently compute a few leading eigenpairs. However, because (C(u))_g is not necessarily low-rank 
in the initial several iterations, much of time may be spent on the first several eigen-decompositions. In the CRFs 
considered in this paper, there are up to 681,600 variables. Using the original SDCut method, the time spent on 
the first several iterations can be prohibitive. 

{ii) In general, a BFGS-like method has a superlinear convergence speed under the condition that the objective 
function is twice continuously differentiable. However, the dual objective function ( [T3] ) is not necessarily twice 
differentiable. So the convergence speed of SDCut is unknown. In practice, SDCut usually needs more than 100 
iterations to converge. 

In the next two sections, we introduce two improvements to the SDCut method, which address the above two 
problems and increase the scalability of SDCut significantly. The improved method is refer to as LR-SDCut and 
its procedure is summarized in Algrithm [T] 
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Algorithm 1 LR-SDCut algorithm for MAP estimation. 

Input: A, {Bi}i=i.2, ,q,_b, 7, Pfmax, r > 0, r < 

Initialization: = 0, E* = +inf, A = A — t^Ijv where v is the r-th smallest eigenvalue of A. 

for fe = 0, 1 , 2,. .., /fmax do 

StepI: - pHVd.^(u(''^), where H is updated to approximate (V^dT,(u^*’^)) ^ and 0 < p < 1 is the step size. 

Step2: = Round(7(C(u<'=+b))+). 

Step3: If E(X(''+b) < E*, X* = X^^+b. 

Step4: Exit, if (d.,(u('=+b) - d.,(u«))/max{|dT,(u(''+b)|, |d.,(u('=))|,l} < r. 

Output: X*, E*. 


A. A Low-rank Initial Point 

If the initialization of the dual variable is 0, then we have C(u(°)) = —A. Without affecting the optimal 
solution to ( [TT| ), A can he perturbed so as to reduce rank((C(u*^‘^^))+) to a small integer, based on: 

(i) For Y G 5” n {trace(Y) = n}, (Y, A + ul„) = (Y, A) + vn. So the matrix A in the problem ( [TT] ) can be 
equivalently replaced by A + ul„, Vu / 0. 

(ii) Suppose that A 7 ^ 0 and x G M"' is an eigenpair of A G S'^, i.e., Ax = Ax, then A + ul„ has an eigenpair: 
A + u and x, Vu ^ 0. 

To decrease the rank of (C(u(‘’)))^ to r <C n, we can equivalently replace A by A — ul„, where a is the 
r-th smallest eigenvalue of A. 


B. Rounding Schemes and Early Stop 

Traditionally, a feasible solution X to the BQP problem ( fT^ is obtained by rounding the optimal solution Y* to the 
corresponding SDP formulation ( [T^ . The rounding procedure will be carried out until the quasi-Newton algorithm 
converges. In contrast, we perform the rounding procedure on the non-optimal solution Y^^^ := at 

each iteration k of the quasi-Newton algorithm (Step2 in Algorithm [T]). In practice, we find that the dual objective 
value of ( [T^ , i.e. the lower-bound to the optimal value of E(X), increases dramatically in the first several iterations. 
Simultaneously, the value of E(X(^)) also drops significantly for the first several ks. This observation inspires us 
to stop the quasi-Newton algorithm long before convergence, without affecting the final solution quality. 

In this work, we adopt the random rounding scheme proposed in ll43l to derive X from Y^^^ := 7 (C(u 
Note that because Y^^) is positive semidefinite, it can be decomposed to Y^^) = where G and 

Ry = rankjY^^^). The rounding scheme can be expressed in the following two steps: 

(i) Random Projection: X = ’J^P, where P G and each entry Pij is independently sampled from the 

standard Gaussian distribution with mean 0 and variance 1, i.e., Pi A(0,1). 

(ii) Discretization: Obtain X G {0,1}^^^ by discretizing the above X, that is, = S{Xi^i > Xip , MV £ C, I' ^ 1). 


C. Computational Complexity and Memory Requirement 


The computational bottleneck of LR-SDCut is the eigen-decomposition of C(u) at each iteration, which is 
performed by Lanczos methods ll^ in this paper. Lanczos methods only require users to implement the matrix- 
vector product C(u)d = —Ad — RiBj)d, where d G M” denotes a so-called “Lanczos vector” produced 

by Lanczos algorithms iteratively. In this section, we will how to accelerate the computation of this matrix-vector 
product by utilizing the specific sfrucfures of A and ... g, and then give the computational cost and memory 

requirement of LR-SDCut. 


For the problem ( |20l ), A 


1 

2 


0 

H 


-K 


, and - ,q have specific strucfures such that 


uiBi = 


Diag(ui) 

1 


LTri(u2) 


U3 ( 


2 = 1 


lujr (g. 1 

Diag(u4) 


( 21 ) 


where ui G U2 G , U 3 ,U 4 G denote the respective dual variables w.r.t. constraints (|20b|l, (|20c|). 


( |20d| ), ( |20e[ ) and such that u = [u[, u^, Ug, u^]^. LTri(u):M^)^ produces an L x L symmetric matrix 
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whose lower triangular part is made up of the elements of the input vector u E that is LTri(u) = 

r 0 if i = j 

< if i> j . Then the matrix-vector product C(u)d can he expressed as: 

t — — i if f ^ j 


C(u)d 


H^da 


Hdi - Kda 

V«. 


U 1 odi + iLTri(u2)di -|- |(ujd2)l 


1 ( 1 ^ 


di)u 3 -I- U 4 o d 2 


Ad:0{NL+NRK) 


(E-=i“iB,)d:0(L2-PAr) 


( 22 ) 


where di G M^,d 2 G and such that d = [dj,d^]^. Accordingly, the computational cost of solving ( |20l ) hy 
LR-SDCut at each descent iteration, that is the complexity of eigen-decomposition of C(u), is: 

o(^{N + L)Ry+ {NRk + NL + L'^) ^ X #Lanczos-Iters, (23) 

matrix-vector product ( | 22 ] ) 

'-V-' 

Lanczos factorization 

and the memory requirement is 0{N{L + Ry + Rr) + LRy), where Rr and Ry denotes the rank of K and 
(C(u))+ respectively. Note that the computational complexity is linear in the number of CRF variables N, which 
is the same as mean field approximation. 


VI. Applications 

To show the superiority of the proposed method, we evaluate it and other methods on two applications in this 
section: image segmentation and image co-segmentation. In the following our experiments, the maximum number 
of iterations fFniax for LR-SDCut is set to 10; the initial rank r is set to 20; and the penalty parameter 7 is set to 
1000 . 


A. Application 1: Image Segmentation 

Following the work in ifTTll . pairwise potentials for image segmentation are expressed in the following form: 


= exp 


iPi - Pjl 

202 



(24) 


where p* and c* are the position and color value of pixel i respectively, and similarly for pj and Cj. The matrix 
defined in ( |24l ) corresponds to the appearance kernel which penalizes the case that two adjacent pixels with similar 
color and different labels. The label compatibility function is given by the Potts model p{l,l') = 6{I ^ I'). 

The kernel matrix can be decomposed to the hadamard product of two independent kernel matrices: = 

o where =exp ^ and A:c^^(fi,fj) =exp 

Nystrom methods are performed on and individually: Kp^^ ss ~ where 

$p G and Then we have: 

= (K^ oK^jd (25a) 

= diag (^$p$pDiag(d)$c^l) (25b) 

= (($p$; (Diag(d)$e)) o 1. (25c) 

This computation requires 0{NRcRp) operations (i?c and Rp are set to 20 and 10 respectively). Performing Nystrom 
on Kp^^ and separately instead of on directly brings two benifits: (f) the memory requirement is reduced 
from RcRp to Rc + Rp\ (ii) For multiple images with the same resolution, we only need to perform Nystrom on 
once, as the input features (positions pj, f = 1, • • • , N) are the same. 

The improved Nystrom method iHH is adopted to obtain the low rank approximation of and Kp^^. As in 
m, K-means clustering is used to select representative landmarks. 

Experiments The proposed algorithm is compared with mean field on MSRC 21-class database. The test data 
are 93 representative images with accurate ground truth provided by lIlTll . The unary potentials are also obtained 
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Original images Ground truth Unary MF+filter MF+Nys. LR-SDCut 


Fig. 1: Qualitative results of image segmentation. Original images and the corresponding ground truth are shown in the first two columns. 
The third column demonstrates the segmentation results based only on unary terms. The results of mean field methods with different matrix- 
vector product approaches are illustrated in the fourth and fifth columns. Our methods achieves similar visual performance with mean field 
methods. 



Unary 

MF-l-filter 

MF-nNys. LR-SDCut 

Time(s) 

NA 

0.29 

6.6 74 

Accu. 

0.79 

0.83 

0.83 0.83 

Energy 

1.29 • 10® 

9.79 • 10"^ 

1.15 • 10® 9.02 • 10'* 


TABLE II: Quantitative results of image segmentation. Our method runs slower than mean field methods but gives significantly lower 
energy. Unfortunately, the lower energy does not lead to better segmentation accuracy. 


from im. The parameters 6 a, OjS and are set to 60, 20 and 10 respectively. The iteration number limit for 
mean field inference is set to 20. All experiments are conducted using a single CPU with 10GB memory. As for the 
matrix-vector product in the mean field method, both the filter-based and Nystrom-based approaches are evaluated 
(refer to as MF-i-filter and MF-i-Nys. respectively). The evaluated images have around 60,000 pixels and so the 
number of MRF variables is also around 60,000 for each image. 

Fig. [T] shows the qualitative results for image segmentation. We can see that our method achieves similar results to 
the mean field approach. In Table |T^ quantitative results are demonstrated. Althgouh the computational complexity 
of mean field and our method are both linear in N, mean field is still faster than ours in this experiment. This 
is partially because the code of mean field is highly optimized using C-i-i-, while ours is unoptimized. A speed 
up is expected if our code is further optimized and parallelized. Note that the filter-based method |[T9l can be 
also incorporated into our algorithm to compute matrix-vector products, which is likely to be faster than Nystrdm 
methods but limited to Gaussian kernels in general. 

Despite the slower speed, our method achieves significantly lower energy than mean field, which means our 
method is better from the viewpoint of MAP estimation. Unfortunately, the superiority of our method in terms of 
optimization does not lead to better segmentation performance. Actually, all of the evaluated methods have similar 
segmentation accuracy. 

B. Application 2: Image Co-segmentation 

The image co-segmentation problem requires that the same object be segmented from multiple images. There are 
two optimization criteria: the color and spatial consistency within one image and the separability between foreground 
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Original images Ground truth LR-SDCut MF+Nys. SDLR SDCut 


Fig. 2: Qualitative results for image co-segmentation. Three classes of objects from MSRC datasets are used for the evaluation. Our approach 
and Mean Field (MF-l-Nys.) are performed on the original pixel-level images. Because SDLR (H and SDCut (H cannot scale up to pixel- 
level images, they are evaluated on superpixels. Our method performs best visually. We randomly repeat mean field approximation 5 times 
for each dataset and select the best result. Mean field is not stable at this task and sometimes converges to an undesirable local optimal point 
(see “tree” for example). SDLR and SDCut achieve worse results than our’s, since some image details are lost due to the use of superpixels. 


Data 

#pics 

N 

LR-SDCut MF-l-Nys. 

N 

SDLR SDCut 

Cow 

10 

681600 

1415 

1965 

6713 

9530 

307 

Sheep 

8 

545280 

1066 

2045 

5375 

6932 

583 

Tree 

9 

613440 

1137 

1490 

6026 

1090 

1316 


TABLE III: Running times for image co-segmentation. Our method is slightly faster than mean field. The number of MRF variables N 
for two groups of evaluated methods are shown in the third and sixth columns. The problems solved by our approach are much larger than 
those of SDLR and SDCut. 


and background over all images. There is no unary potentials for image co-segmentation and the pairwise potentials 
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LR-SDCut MF-rNys. 

SDLR SDCut 

Cow 

0.73(-1.59 • 10'") 0.67(-1.58 • 10'’) 

0.66 

0.69 

Sheep 

0.74(-8.07- 10^) 0.49(-6.87' 10'‘) 

0.57 

0.58 

Tree 

0.83(-2.23 • 10®) 0.65(-2.03 • 10®) 

0.66 

0.68 


TABLE IV: Segmentation accuracy (energy) of image co-segmentation. Our method and Mean field work on original pixels, while SDLR 
and SDCut work on superpixels. For all the three evaluated datasets, our method achieves the lowest energies and highest segmentation 
scores. 



Fig. 3: Rank and Energy at each iteration for co-segmentation on the “cow” data set. Both of the rank of (C(ufc))+ and the energy of 
binary solution y* decrease significantly in the first several fcs. 


are shown in the following: 


-- 


Pi I 


291 


K(2) = 


I Ci Cj 
20 ^ 


(26a) 

(26h) 


where ipij = 1 if pixels i and j locate in the same image; ipij = 0, otherwise, k > 0 is a regularization 
parameter. is a hlock-di agonal matrix, and the matrix-vector product for can he computed using the 


method descrihed in Section 
J_1lT 
N 


VI-A 


details). S2Ar = lAr — ^11’ is 


K(2) is the inter-image discriminative clustering cost matrix (see for 
the centering projection matrix, and is the kernel matrix of sift features. 


K(2) can he approximated hy a low-rank decomposition: 
inversion lemma, we have: 


where $ G 


bNxRK„ 


Based on the matrix 


k ( 2 ) = ]nN 

kN V '-/ 

decompose to and 1=0 

= (27) 

Through the above equation, the matrix-vector product for can he computed efficiently in 0 {NRk2 ) time 
(Rk 2 is set to 640 in the experiments). The pairwise potentials are not necessarily suhmodular, because entries of 
Ki2) may be negative. Note that the matrix-vector product for cannot be performed by the filter-based method 

ofsm, because may not be a Gaussian kernel. 

Experiments Three groups of images are selected from the MSRC dataset for image co-segmentation. Besides 
our approach and mean field, the SDP-based algorithms in Il4^ (denoted as SDLR) and ifT^ (denoted as SDCut) 
are also evaluated. Our method and mean field are evaluated at the original pixel level, while SDLR and SDCut 
are evaluated only on superpixels. 

The code for SDLR and SDCut is provided by authors of the original papers, where the default settings are used. 
The iteration limit for mean field is set to 100. To prevent mean field from converging to undesirable local optima, 
we randomly run the method 5 times. All experiments are conducted on a single CPU with 20GB memory. The 
intersection-over-union accuracy is used to measure the segmentation performance. 
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From the results illustrated in Fig. we see that our approach achieves much more accurate co-segmentation 
results than both SDLR and SDCut. The performance of mean field is also worse than ours. 

Table [^demonstrates the number of variables and computational time for each method. The number of variables 
for the problem solved by our method and mean field is around 100 limes larger lhan Ihose for SDLR and SDCul. 
Our approach is slighlly fasler lhan mean field, and significanlly more scalable lhan SDLR and SDCul. 

The quanlilalive performance is shown in Table IV Our approach achieves significanlly heller co-segmenlalion 
accuracy lhan all Ihe olher melhods. As for energy, our approach also produces lower energies lhan mean field. 
Empirically, we found mean field is sensilive lo inilializalion. Take “free” as example, Ihe difference is 5.3 • 10^ 
belween Ihe best and worst energy in the 5 repeats of mean field wilh random inilializalions. If we repeal mean 
field 100 limes, Ihe besl energy improves from —2.03 • 10^ lo —2.08 • 10^, bul still worse lhan ours ( — 2.23 • 10®). 

Fig. 1^ shows Ihe change of rank((C(u(^)))_|_) and E(y(^)) w.r.l. iteration k. Bolh of Ihe rank and energy drops 
quickly in Ihe lirsl several iterations. Simullaneously, Ihe lower-bound of Ihe optimal energy E(y) {i.e. Ihe dual 
objective value) increases from —8.09 • 10^ lo —4.36 • 10®. 


VII. Conclusion 

In Ihis paper, we have proposed an eflicienl, general melhod for Ihe MAP estimation of fully-connected CRFs. The 
proposed SDP approach is more slable and accurate lhan mean field approximation, which is also more scalable 
lhan previous SDP melhods. The use of low-rank approximation of Ihe kernel malrix lo perform malrix-veclor 
producls makes our approach even more eflicienl and applicable for any symmelric positive semidefinile kernel. In 
conlrasl, previous filler-based melhods assume pairwise potentials lo be based on a Gaussian or generalized RBF 
kernel. The compulalional complexily of our approach is linear in Ihe number of CRF variables. The experimenls on 
image co-segmenlalion validate lhal our approach can be applied on more general problems lhan previous melhods. 

As for fulure works, Ihe proposed melhod can be parallelized lo achieve even faster speed. The core of our 
melhod is quasi-Newlon (or gradienl descenl) and eigen-decomposilion, bolh of which can be parallelized on GPUs. 
Malrix-veclor producls, Ihe main compulalional cosl, can be implemented using CUDA function “cublasSgemm”. 

VIII. Appendix 

A. SDP formulation for an arbitrary label compatibility function 

For an arbilrary label compalibilily function /r : —)■ [0,1], wilh Ihe properties lhal p,{l,l') = //((', 1), VI,1' G £ 


and p{l,l) = 0,V/ G £, Ihe objective function of Q, E(x), can be re-writlen as follows: 

E{x) ='^ ^ p,{xi,Xj)Kij, (28a) 

i£jV i,jeJV,i<j 

= ^ 'fi{l)d{xi = 1) + ^ p{l,l')6{xi = l)6{xj = (28b) 

= h^y + ^yT ((U + 11^) ® k) y, (28c) 

= hTy + ^yT(U®K)y + ^lTKl, (28d) 

where y G {0,1}^'^, h G U G 5^ are defined as = 5{xi = 1), = '!/>,((), Mi e Af,l ^ C 

and Uip = p{l,l') — 1, V(, I' G J\f. Such lhal Ihe energy minimization problem Q can be equivalenlly reformulated 
lo Ihe following binary quadratic problem: 

min E(y) := h^y -p ^y®" (U (g) K) y, (29a) 

ye{o,i}™ 2 

s-t- YA=iy{i-i)L+i = 1, VI G M, (29b) 


Note lhal Ihere is also a one-lo-one correspondence belween Ihe set of x G £^ and the set of y G {0,1}^^ 
satisfying ( |29b ), and E(x) = E(y) -|- for equivalent x and y. 
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By defining Y := yy^, the SDP relaxation to ([29|) can be expressed as: 


min 

Ye-S™ 

s.t. 


(Y,Diag(h) + ^Ui 


K) 


(i-i)L + i, ) — 1, Vi G AA, 

(i - 1)L + I 

- (Y (i _ i)L + i, + Y (i _ 1)1, + i'_ ) = 0, yi ^ I' ,l,l' ^ C,i ^ J\f, 

^ (i- 1)L + l' (i - 1)L + I 


(30a) 

(30b) 

(30c) 


and we have trace(Y) = N due to constraints ( 30b i. The non-convex constraint rank(Y) = 1 is dropped by the 
above SDP relaxation. There are 1 constraint ( |30b[ ) and L{L — l)/2 constraints P0c[ ) for each i G M. The problem 
(301 can also be expressed in the form of ( [TT] ), and solved by SDCut algorithm. In this case, n = NL, rj = N, 
q = N + NL{L-l)/2, A = |U(8)K, and 

LTri(u2,i) ■■■ 0 


UiBi = Diag(ui) (g) II + 


2=1 


LTri(u2,jv) . 


(31) 


where ui G denotes the dual variables w.r.t. the constraints ( 30b i, and U 2 ,j G corresponds to the 

constraints ( |30c| ) for each i G Af. We also have u = [u[, uj • • • , 

Then the matrix-vector product C(u)d, Vd G can be computed as: 


C(u)d = -(hod+ lr(K[di, • • • , dArfU 


Ad-. 0 {NLRk+NL'^) 

(ui (g) 1) o d + ^ djLTri(u 2 ,i),..., d]vLTri(u 2 ,Ar) 




(32) 


where d is decomposed as d := [dj, d^,..., di,-- - ,d 7 v G and T : is defined as 

T(X) = • • • , Xi^L,X 2 ^i, • • • , X 2 ^l, ■ ■ ■ j Atv Given that K has an i?A-rank approximation and (C(u))_|_ 

has the rank of Ry, the overall computational complexity of solving ( |30l ) using SDCut at each iteration is 

o(^NLRy+ [NLRk + NL^) X #Lanczos-Iters. (33) 

matrix-vector product ( [^ 

'-V-' 

Lanczos factorization 


The corresponding memory requirement is 0{NLRy L'^ + NRk)- Note that the above formulation still need to 
be further validated by experiments. 
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